Fast multiple sequence alignment via multi-armed bandits

Abstract Summary Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. Availability and implementation The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Introduction
Multiple sequence alignment (MSA) is a central problem in computational biology with applications that include phylogeny inference (Morrison and Ellis 1997), detection of remote homology between protein sequences, protein structure and function inference (Bork andKoonin 1998, Ju et al. 2021), and DNA data storage (Antkowiak et al. 2020).While significant progress in MSA algorithms has been made in recent years, achieving high alignment accuracy on very large datasets in a computationally efficient manner remains a challenge.
One algorithm that has been shown to produce highquality alignments on large datasets is ultra-large alignments using phylogeny-aware profiles (UPP) (Nguyen et al. 2015).In particular, UPP has been shown to produce higher-quality alignments than other algorithms on large datasets with high levels of sequence length heterogeneity, while giving similar levels of performance on large datasets with little sequence length heterogeneity.While UPP gives improved alignment accuracy on large datasets, it is often slower than other widely used software packages such as MUSCLE (Edgar 2004), MAFFT (Katoh and Toh 2007), and Clustal-Omega (Sievers et al. 2011).
At a high level, UPP begins by creating an initial alignment and a maximum likelihood (ML) tree from a subset of the input sequences called backbone sequences.These backbone sequences are selected randomly from the set of input sequences that are close to the median input sequence length.All sequences that are not part of the backbone are called query sequences.The ML tree is then decomposed to form sets of related sequences.For each of these sets of sequences, a hidden Markov model (HMM) is formed from its multiple alignment using HMMer (Finn et al. 2011).This yields an ensemble of HMMs, as illustrated in Fig. 1.Next, each query sequence is assigned to the HMM that has the highest probability of generating it.For each HMM, the assigned query sequences are added to the alignment corresponding to the HMM using HMMer, one by one.The resulting alignment for each HMM is then merged with the backbone alignment, producing an MSA for the full set of sequences.
For large datasets, the query-to-HMM assignment step is by far the most time-consuming task in UPP.This is because, for each query sequence and each HMM, the probability of the HMM producing the sequence is calculated in Oð' 2 Þ time where ' is the (maximum) input sequence length.If there are n query sequences and m HMMs, the query-to-HMM assignment step takes Oðnm' 2 Þ time.In a recent work (Park et al. 2023), a new algorithm named UPP2 (Park et al. 2023) was designed to speed up the query-to-HMM assignment step.For each query sequence, UPP2 only computes the probability for certain HMMs, chosen according to the structure of the ML tree.This reduces the run-time to Oðn logðmÞ' 2 Þ, which leads to very substantial time savings, at the price of a small decrease in alignment quality.
In this work [originally developed independently and without knowledge of Park et al. (2023)], we pursue a different route to speed up the query-to-HMM assignment step.Rather than reducing how many HMMs each query sequence is compared against, we reduce how much computation is spent in each query-to-HMM comparison.To do so, we introduce two algorithmic ideas: 1) A new k-mer-based similarity score Jðq; hÞ that works as a proxy for the probability that a query sequence q was generated by HMM h.We refer to Jðq; hÞ as the J-score.Notably, Jðq; hÞ can be efficiently estimated by sampling k-mers in time sublinear in '. 2) We leverage the fact that Jðq; hÞ can be estimated using random k-mer samples to propose an adaptive estimation framework for finding arg max h Jðq; hÞ.We take inspiration from the recent literature on using Multi-Armed Bandits (MABs) to speed up large-scale computations via adaptivity (Bagaria et al. 2018, 2021, Heckel et al. 2018, Tiwari et al. 2020, Kamath et al. 2020).
An overview of the adaptive search for arg max h Jðq; hÞ is shown in Fig. 2. By drawing random subsets of k-mers, estimates of the score Jðq; hÞ can be efficiently computed.This allows for iterative refinement of estimates of Jðq; hÞ for more promising HMMs.Building on theoretical results for MABs, we show that using the Upper Confidence Bound algorithm (Lai and Robbins 1985), it is possible to identify arg max h Jðq; hÞ with high probability in time Oðmn log mÞ.However, for our practical implementation, we opt for an algorithm based on the Sequential Halving MAB algorithm (Karnin et al. 2013).This implementation runs in time Oðmn þ m'Þ and achieves very good performance.In particular, when used in the UPP pipeline, it reduces the overall runtime substantially for datasets containing long sequences (with similar alignment accuracy), even when compared to UPP2.

Adaptive search for best HMM
As described in Section 1, our approach for accelerating UPP is based on a new similarity metric, the J-score, which admits an adaptive search for arg max h Jðq; hÞ.In Sections 2.1 and 2.2, we first introduce the J-score and then we describe the adaptive search based on sequential halving.

J-score
We first introduce some notation.For a sequence s, let jsj denote the length of s.A k-mer of a string s is simply a length-k substring of s.For a given sequence s, let N k ðsÞ be the set of k-mers in s.For a set S of sequences, let N k ðSÞ ¼ [ s2S N k ðsÞ.For a sequence s and k-mer a, let c s ðaÞ be the number of times a appears in s.For a set of sequences S, let c S ðaÞ ¼ 1 jSj P s2S c s ðaÞ.Let n be the number of query sequences, m be the number of HMMs, and ' be the maximum length over all sequences in the dataset.
In the original UPP pipeline (Nguyen et al. 2015), each of the HMMs is built from a subset of the backbone sequences.For an HMM h, we let S h be the subset of backbone sequences used by UPP to create the HMM (which is done using HMMer; Finn et al. 2011).Our similarity score Jðs; hÞ can be thought of as a kind of weighted Jaccard similarity (Jaccard 1912) between the k-mers in q and in S h .We formally define it as Jðq; hÞ ¼ P a2N k ðqÞ minðc q ðaÞ; c S h ðaÞÞ ðjqj−kþ1Þþ Each k-mer a that appears in both q and S h , contributes an additive term of minðc q ðaÞ; c S h ðaÞÞ to the numerator in (1).This can be thought of as a kind of intersection between the k-mers of q and the k-mers of an "average" of the sequences in S h (since c S h has a normalization factor of jS h j).The denominator is simply the number of k-mers in q plus the average number of k-mers in S h .The J-score is inspired by the k-mer Jaccard similarity and its usefulness in estimating pairwise sequence alignment scores (Berlin et al. 2015, Jain et al. 2018, Kamath et al. 2020).In particular, the J-score is equivalent to the multiset Jaccard similarity (Rajaraman and Ullman 2011), except that each k-mer can appear a rational number of times in a multiset.We propose to perform the query-to-HMM assignment based on the J-score, i.e. assigning query q to h � ¼ arg max h Jðq; hÞ; (2) instead of doing this assignment based on the bit-score (which corresponds to the probability of the HMM generating the query sequence), which is employed in UPP.As we verify empirically (see Fig. 3), the J-score is roughly monotonically increasing in the bit-score (which is the score UPP utilizes to perform the query-to-HMM assignment).This monotonic trend tends to hold particularly well for larger values of J-score/bit-score, which is what is important when trying to choose arg max h Jðq; hÞ.
We analyze the relationship between J-score and bitscore in detail for the 16S.3 and 16S.T nucleotide datasets from the Comparative Ribosomal Website (Cannone et al. 2002), and the adh amino acid dataset from homfam (Sievers et al. 2011).The 16S.3 dataset has 6323 sequences.Using a backbone size of 1000, UPP produces 271 HMMs.The UPP algorithm therefore compares 5323 sequences to 271 HMMs.The 16S.3 dataset has 7350 sequences, and the adh dataset High-level description of the UPP pipeline.The input sequences are split into two parts, the backbone sequences and the query sequences.An alignment and tree are estimated for the backbone sequences, and an ensemble of HMMs is constructed based on the backbone alignment and tree.This is followed by a query-to-HMM assignment step, which in principle requires computing the probability that each HMM could have generated each query sequence.

Figure 2.
Adaptive search for the HMM h i , i ¼ 1; . . .; m, that maximizes Jðq; h i Þ.We first estimate the similarity score Jðq; hÞ for each HMM based on a random k-mer batch, and discard HMMs with a low score.The score for the remaining HMMs is refined based on a new k-mer batch, and this process can be repeated.In the end, the exact value of Jðq; hÞ is computed for a small number of HMMs, and the best one is chosen.
Fast multiple sequence alignment via multi-armed bandits contains 21 331 sequences.The average sequence lengths of 16S.3, 16S.T and adh are 1492, 1557, and 124, respectively.We define d 0 to be the fraction of query sequences where the top-scoring HMM according to bitscore is the top-scoring HMM according to J-score.We define d x to be the fraction of query sequences where the top scoring HMM according to bitscore is among the top x percent of HMMs according to J-score.For each query sequence, we also compute the ordering of the HMMs according to bitscore, and the ordering of the HMMs according to J-score.We then compute the Spearman's rank correlation coefficient (Spearman 2015) between the bitscore HMM ordering and the J-score HMM ordering, along with its associated P-value.This correlation coefficient measures how well the relationship between the J-score and the bitscore can be described using a monotonic function.The Spearman coefficient ranges from −1 to 1, with 0 implying no correlation, and 1 implying an exact monotonic relationship.We report this information for the three datasets in Table 1.Observe that in Table 1, the value of k that causes the J-score to correlate best with bitscore is lower for the amino acid dataset adh compared to the two nucleotide datasets.This is because it is harder for the short amino acid query sequences in adh to share long k-mers with the backbone sequences used to form the HMMs, and thus yield non-zero J-scores.It is harder for adh query sequences to share long k-mers with the backbone sequences because there are 20 amino acids as opposed to four nucleotides and because the sequences in adh are much shorter and therefore have far less kmers than sequences in the 16S datasets.When a query sequence does not share any kmers with any backbone sequences, the J-scores are zero for all HMMs, and thus do not correlate well with the bitscores (which are generally not constant across HMMs).
We note that the J-score can be computed naively for all pairs of query sequences and HMMs in amortized Oðnm'Þ time by building a hash table for q that maps each k-mer a present in q to c s ðaÞ, and building a hash table that maps each k-mer a present in S h to c S h ðaÞ.The summation in the numerator in (1) can then be computed in amortized Oð'Þ time.
While the specific form of the score in (1) may seem arbitrary, our main motivation for working with it is that it allows for the computation of unbiased estimates of Jðq; hÞ from randomly selected k-mers from N k ðqÞ.Let P q be a distribution that chooses each k-mer in the set N k ðqÞ with equal probability (i.e., with probability P q ðaÞ ¼ jN k ðqÞj −1 for a 2 N k ðqÞ).For a batch of k-mers B ¼ fa 1 ; . . .; a B g of size B, drawn i.i.d.according to P q , one can build an estimator This is an unbiased estimator because   a The backbone sizes used are 1000 for all datasets.The Spearman coefficients and corresponding P-values are averaged across all query sequences in a dataset.The Spearman coefficient and corresponding P-value is not defined for a query sequence when the J-scores for that query sequence are equal to some constant for all HMMs.Therefore, if a query sequence does not share any k-mers with any backbone sequences, the J-score is 0 for all HMMs, and the corresponding Spearman Coefficient is not defined.If the Spearman Coefficient is not defined, we set the Spearman Coefficient to 0, and set the P-value to 1 to be as adversarial as possible when computing the averages in the table.

i330
Mazooji and Shomorony Notice that, unlike for the J-score, the standard approach for estimating the Jaccard similarity jA\Bj jA[Bj between sets A and B is the use of min-hashes (Broder 1997, Broder et al. 2000, Berlin et al. 2015).What makes Jðq; hÞ somewhat different from the Jaccard similarity is the fact that the denominator of Jðq; hÞ does not require a set union calculation, it is just a function of sequence length.Thus, we only need to estimate the numerator of Jðq; hÞ from samples.

Adaptive search via multi-armed bandits
Because we have an unbiased estimator for Jðq; hÞ based on samples from N k ðqÞ, we can search for h � ¼ arg max h Jðq; hÞ adaptively by iteratively sampling more k-mers from N k ðqÞ in order to refine the estimate of Jðq; hÞ for more promising HMM candidates h.Our goal is then to minimize the number of times we need to evaluate Jðq; h; fagÞ for a k-mer a in N k ðqÞ.We refer to the evaluation of Jðq; h; fagÞ as a "k-mer evaluation" on h.
The problem of finding h � ¼ arg max h Jðq; hÞ while minimizing the total number of k-mer evaluations fits well within the MAB literature.In the MAB setting, there are several random variables (referred to as "arms"), and at each time step, we can sample one of the random variables (or "pull an arm").In the best-arm identification problem (Jamieson and Nowak 2014), the goal is to identify the arm with the largest mean reward (with high probability) using as few arm pulls as possible.In our problem, each arm corresponds to an HMM, and pulling arm h corresponds to sampling a k-mer a from N k ðqÞ uniformly at random, and evaluating Jðq; h; fagÞ.This is a best-arm identification problem because we want to find h � ¼ arg max h Jðq; hÞ by performing as few k-mer evaluations as possible.
Two well-known algorithms for accomplishing this are Upper-Confidence Bound (UCB) (Lai and Robbins 1985) and Sequential Halving (Karnin et al. 2013).The UCB algorithm is widely used in the literature and is amenable to a clean theoretical analysis of the number of arm pulls needed to identify the best arm with high probability (Lattimore and Szepesv� ari 2020).Sequential Halving (Karnin et al. 2013) is simpler to implement and achieves great results in many practical settings (Baharav and Tse 2019), although its theoretical analysis is less straightforward.
For this reason, we first state a theoretical result characterizing the number of k-mer evaluations needed to identify the best HMM h � ¼ arg max h Jðq; hÞ when the UCB algorithm is applied to our problem, but use Sequential Halving in our software implementation due to its good performance in practice (Cazenave 2015, Pepels et al. 2016, Baharav and Tse 2019).We present the UCB-based algorithm as Algorithm 2 and its theoretical analysis in detail in Section 5.Under some regularity conditions (see Section 5), this analysis implies that the query-to-HMM assignment problem using J-scores can be solved very efficiently: Corollary 1 The optimal HMM h � in the search problem h � ¼ arg max h Jðq; hÞ can correctly identified in time Oðmn log mÞ with probability 1−oð1Þ.
While the UCB algorithm provides us with a time complexity that is independent of ', for our practical implementation we utilize a simpler adaptive algorithm that still has a linear dependence on '.The algorithm we implemented in the software is a modified Sequential Halving algorithm, and it is presented as Algorithm 1.Observe that Algorithm 1 takes OðmRBþT'Þ k-mer evaluations and OðmRBþT'Þ amortized time since we can pre-compute a hash-table mapping each kmer a 2 N k ðS i Þ to c Si ðaÞ for each i, along with the analogous map for q.Applying this algorithm to all n query sequences requires OðnmRBþTn'Þ amortized time.For R, B, T constant, the time complexity is Oðnmþn'Þ, which gives a better dependence on ' than UPP and UPP2.It also gives an improved run-time compared to a naive version of our algorithm that simply computes Jðq; hÞ for all q and h, which requires Oðnm'Þ time.In practice, we pick R, B, T depending on how confident we want to be in the selected h � .Note that if a query sequence does not share a k-mer with any sequence in any of the HMMs, it is assigned to the HMM corresponding to all sequences in the backbone.We also point out that we parallelized the algorithm to make use of a user-specified number of cores.All of our code is written in Python, and is available at: https://github.com/ilanshom/adaptiveMSA. The additional scripts used to generate the results in this paper are also available at this link.

Datasets and performance metrics
The first three nucleotide datasets we use are from the Comparitive Ribosomal Website (Cannone et al. 2002).They are named 16S.3,16S.T and 16S.B.ALL.These three biological datasets were used in the UPP and UPP2 papers.The next three nucleotide datasets we test on were generated by Indelible (Fletcher and Yang 2009), and were introduced in (Mirarab et al. 2014).They are named 10000M2, 10000M3, and 10000M4 and were used in the UPP paper.The final three nucleotide datasets we test are called RNASim10000, RNASim50000, RNASim100000, and RNASim200000 and were introduced in Mirarab et al. (2014).These simulated datasets were tested in the UPP paper.The first three amino acid datasets we tested were generated using ROSE (Stoye et al. 1998) (Sievers et al. 2011).The 19 datasets in HomFam used are as follows: aat, Acetyltransf, adh, aldosered, biotin_lipoyl, blmb, ghf13, gluts, hla, hom, myb_DNA-binding, p450, PDZ, Rhodanese, rrm, rvp, sdr, tRNA-synt_2b, zf-CCHH.Each of these biological datasets has a reference alignment for a very small subset of the sequences (5-20 sequences, median 7).This is in contrast to all other datasets, which have full reference alignments.Information on the number of sequences and average sequence length for each dataset is present in Table 2.Note that both 16S and homfam include datasets with high levels of sequence length heterogeneity, which UPP is known to handle well (Nguyen et al. 2015).All datasets were obtained from this website: https://sites.google.com/eng.ucsd.edu/datasets/alignment/pastaupp.
We report SP-error, SP-score, modeler-score, TC-score.In an MSA A, consider sequences q 1 and q 2 .The ith symbol in q 1 is said to be homologous to the jth symbol in q 2 if they appear in the same column in A. In this case, q 1 ½i� and q 2 ½j� are said to form a homologous pair.For a reference alignment A and an estimated alignment A 0 ; the SPFN rate is the fraction of homologous pairs in A that are not present in A 0 : The SPscore is defined as 1 -SPFN, and is a measure of recall.The SPFP rate is the fraction of homologous pairs in A 0 that are not present in A. The Modeler score is defined as 1 − SPFP and is a measure of precision.The SP error is equal to the average of the SPFN rate and the SPFP rate (Nguyen et al. 2015).We define the Total Column score (TC-score) as the number of columns in A 0 that are present in A, divided by the total number of columns in A. We use FastSP (Mirarab and Warnow 2011) to calculate all metrics.

Experiments
We test our algorithm, UPP, and UPP2 on all datasets mentioned above.UPP and UPP2 are already compared extensively with existing MSA packages, so we focus on comparing our algorithm with UPP and UPP2.Throughout this section, we refer to our modified version of UPP that makes use of Algorithm 1 to estimate the best HMM for each query sequence as J-bandit.We refer to the modified version of UPP that assigns each query sequence to the HMM with the highest J-score as J-exact.We run all algorithms by generating an alignment of the backbone sequences using PASTA (Mirarab et al. 2014(Mirarab et al. , 2015)).Using PASTA for this task is the only option included in UPP and UPP2.The backbone is a For all datasets, c ¼ 0:2; R ¼ 3; T ¼ 10 for the J-bandit runs.The number of sequences is in parentheses below dataset name, followed by average sequence length in parentheses [for homfam ( 19), these statistics are averaged over the 19 datasets].The times reported for homfam (19) do not include backbone generation.

i332
Mazooji and Shomorony selected randomly from the set of input sequences whose length is within 25% of the length of the median sequence length as is standard in UPP.For all algorithms in all experiments, we specified that 24 processors can be used.We ran all simulations on a machine with 80 physical cores, 160 threads, and 512 GB of memory.We report SP error, SP score, modeler score, TC score, and time taken by the algorithm from start to finish (including backbone generation).We also report peak memory usage for a subset of the datasets.We began by running J-exact for k values of 5, 10, 15, 20, and 25 on 16S.B.ALL, Indelible 10000M2, Indelible 10000M3, Indelible 10000M4, RNASim10000, ROSE 1000S1, ROSE 1000M1, and ROSE 1000L1.These datasets were chosen to include a mix of nucleotide and amino acid datasets.We did not include any homfam datasets in this experiment because they only have reference alignments for very small subsets of the sequences, and because the sequences are very short in comparison to the other datasets (the average of the average sequence lengths of the homfam datasets was 144, while the average sequence length of all other datasets was at least 1000).We made sure to include 16S.B.ALL in the experiment because it is a biological dataset as opposed to simulated, and because it displays substantial sequence length heterogeneity, which UPP is known to handle well (Nguyen et al. 2015).For the 16S, Indelible, and RNASim datasets, we used a backbone of 1000, while for ROSE datasets, we used a backbone of 100.We observed that for all performance metrics, setting k to 20 gave comparable performance to UPP, as shown in Fig. 4. We therefore set k to 20 in proceeding experiments, with the exception of the homfam datasets which have sequences of much shorter length than the other datasets.
Next, we ran J-bandit on 16S.B.ALL for a range of parameters in Algorithm 1 to observe their effect on performance and runtime.We chose 16S.B.ALL because it is one of the largest datasets we had in terms of the number of sequences and the sequence lengths, it has substantial sequence length heterogeneity, and because it is biological (as opposed to simulated).For the sequence q, the batch size B is chosen to be c � jqj where jqj is the length of q and c is a constant.We tested c values of 0.1, 0.2, and 0.3, and tested R values of 2, 3, and 4. We kept k fixed at 20, and T fixed at 10 for these experiments.The performance on 16S.B.ALL does not change much for the various values of c and R that we tested and remains close to the performance of UPP and J-Exact.The runtime does not seem to change significantly either across the parameter settings but is significantly lower than UPP and J-Exact.Based on these observations, we choose c to be 0.2, R to be 3, and T to be 10 for J-bandit in all of the proceeding runs of J-bandit.
Finally, we ran J-bandit on a wide range of datasets and compared its performance to UPP and UPP2.All datasets in Section 3.1 other than RNASim 200000 were tested and the SP error and all performance metrics along with time taken were calculated and are shown in Table 2.We did not compare the algorithms on RNASim 200000 because due to the fact that such a comparison would use excessive computing time: UPP used over 28 h to run on RNASim 100000 and the runtime of all three algorithms scales roughly linearly with the number of query sequences.For all datasets, J-bandit used parameters c ¼ 0:2, R ¼ 3, and T ¼ 10.For all datasets besides those in homfam ( 19), we set K to 20.For the 19 large homfam datasets, we set K to 10 because these sequences have a much shorter average length of 144, with many sequences shorter than 20 (the average sequence length of all other datasets was at least 1000).Note, however, that setting K to 20 for the 19 homfam datasets did not change the overall average performance by much.We used a backbone size of 1,000 for all datasets except for the three ROSE datasets since these datasets only have 1000 sequences.For these three ROSE datasets, we used a backbone size of 500 sequences.
To summarize the results in Table 2, J-bandit runs faster than UPP, creating alignments with similar, though often slightly degraded accuracy.Compared to UPP2, J-bandit sometimes gives improved accuracy and sometimes gives degraded accuracy depending on the dataset.Similarly, J-bandit is faster than UPP2 in most, but not all cases.For datasets with large sequence lengths and many sequences, e.g.16S datasets and RNASim datasets, J-bandit is significantly faster than both UPP and UPP2 as highlighted in Fig. 5.For example, for the RNASim-100000 dataset, J-bandit completes in 16.5% of the time used by UPP2, and only 7% of the time used by UPP.This is expected from a theoretical standpoint since both UPP and UPP2 calculate bitscore, which requires quadratic time in the sequence length ', whereas even exact computation of the J-score requires only linear amortized time in '.For datasets with shorter sequence lengths such as the three Indelible datasets, the three ROSE datasets, and the 19 homfam datasets, we observe the increase in speed of J-bandit is not as pronounced compared to UPP and UPP2.In some cases, J-bandit is even slower than UPP2 (e.g.Indelible 10000M3, 10000M4).This is not surprising, since UPP2 is designed to be faster than UPP, though unlike J-bandit, it is not designed to reduce the effect of sequence length on runtime.It should be noted that our algorithms are written in Python, while HMMer (Finn et al. 2011), which is used for making the assignment in UPP and UPP2, is written in C. Hence, it may be possible to accelerate our J-score-based approach even further by writing it in a low-level programming language like C.
We also performed a smaller experiment to assess peak memory usage on four large datasets.In Table 3, we show the peak memory usage of UPP, UPP2, and J-bandit on 16S.3, Indelible 10000M2, RNASim 10000, and ROSE 1000M1.We chose these four large datasets because we are most interested in memory usage when a lot of data need to be stored, in both the nucleotide and amino acid cases.We obtained these results using the "memory-profiler" Python package.The paramater settings used in these runs are identical to those used in the runs presented in Table 2.We observe that J-bandit has a higher peak memory usage than UPP for all datasets.In comparison to UPP2, J-bandit has a higher memory usage on some datasets and a lower memory usage on other datasets.UPP uses less memory than J-bandit due to the fact that J-bandit creates hash tables that store all k-mers in the query sequences and backbone sequences in order to efficiently estimate the J-score.In contrast, these hash tables are not created for UPP and UPP2.In addition, the memoryintensive computation of estimating and computing the Jscore is implemented in Python in J-bandit, whereas bitscore computation in UPP and UPP2 is performed by HMMer which is written in C, a language that generally uses less memory than Python.

Conclusion
In this work, we proposed a method to speed up the queryto-HMM assignment step in the UPP pipeline.This strategy is based on two key ingredients: the introduction of the J-score and an adaptive search algorithm inspired by Multi-Armed Bandit algorithms.This allows us to achieve theoretical and practical reductions in run-time when replacing the query-to-HMM module in UPP with our proposed approach.
While the techniques introduced were developed for the specific setting of the UPP pipeline, we believe that they may be of broader interest in bioinformatics since bitscores are used to choose the best HMM in many applications including orthology detection, and metagenomic pipelines.The J-score can be thought of as a kind of Jaccard similarity between a sequence and a set of sequences and can be easily generalized to measure similarity between two sets with different numbers of sequences.As we verified empirically, this score can be used as a proxy for the bit-score between a sequence and an HMM, in situations where exact calculation of the bit-score may not be needed.Finally, we point out that techniques from MAB may be applicable to other MSA pipelines.

Theoretical guarantees via the batched UCB algorithm
In this section, we describe how a version of the UCB algorithm (Lai and Robbins 1985) can be used to show that each query q can be assigned to the best HMM based on J-score in time Oðm log mÞ.In particular, we will use a batched version of the UCB algorithm (see, e.g.Tiwari et al. 2020), which is appropriate for the J-score refinement based on kmer batches.
The batched UCB algorithm adapted to our problem is given by Algorithm 2. Similar to the standard UCB algorithm (Lai and Robbins 1985), the algorithm assumes that for each HMM h and a random k-mer a, the random variable Jðq; h; fagÞ is σ-sub-Gaussian, and that the parameter σ (or an upper bound) is known.Recall that a random variable X is σ-sub-Gaussian if PrðX > tÞ ≤ 2 expð−t 2 =σ 2 Þ.Observe that Jðq; h; BÞ is trivially sub-Gaussian because it takes values in a finite set.In this case, an upper bound on the random variable Jðq; h; BÞ has a subgaussianity parameter 1 2 ðmax a2N k ðqÞ Jðq; h; fagÞ−min a2N k ðqÞ Jðq; h; fagÞÞ.An upper bound on this quantity that can be used in place of it in the algorithm is 1 2 max a2N k ðSiÞ c Si ðaÞ and can be found in a preprocessing step on the sets.Algorithm 2 works by maintaining a  a For all datasets, c ¼ 0:2; R ¼ 3; T ¼ 10 for the J-bandit runs.The number of sequences is in parentheses below dataset name, followed by average sequence length in parentheses.
set S active of active arms (HMMs), initialized as f1; . . .; mg.For each HMM h 2 S active , an estimate Ĵh of Jðq; hÞ is maintained.At each iteration, a random k-mer batch of size B is drawn (with replacement) from N k ðqÞ and the estimates Ĵh is updated for all h 2 S active .At the end of each iteration, we eliminate all h whose confidence interval does not intersect with the confidence interval of the current best candidate max y Ĵh .
Once one HMM is left in S active (or t used ≥m), we output it.Notice that this algorithm is similar to Algorithm 1, except that a more careful elimination criterion is used at the end of each round, based on confidence intervals.This allows us to obtain a theoretical guarantee for Algorithm 2. For h 2 ½1 : m�, let Δ h ¼ Jðq; h � Þ−Jðq; hÞ.Then we have Theorem 1 For δ ¼ m −3 , with probability at least 1− 2 δ , the algorithm returns the best HMM h � ¼ arg max h Jðq; hÞ using a total of M k-mer evaluations, where Notice that if σ=Δ h is Θð1Þ, then the algorithm finds h � with Oðm log mÞ k-mer evaluations.Since we can precompute a hash-table mapping each k-mer a 2 N k ðS i Þ to c Si ðaÞ along with an analogous map for q, we can find h � with high probability in Oðm log mÞ amortized time.Finding h � for all query sequences q results in an amortized run-time of Oðnm log mÞ, as we state in Corollary 1 in Section 2. This removes the dependence on ' completely (while UPP and UPP2 have a quadratic dependence on ') and also improves upon the naive exhaustive search algorithm that computes each Jðq; hÞ exactly in time Oðnm'Þ.
Proof.Notice that t used keeps track of how many k-mers have been used in the estimates Ĵh , for where the equality follows since C ¼ σ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2 logð1=δÞ t used q .Due to the constraint t used <m in the while loop, at most m=B iterations occur, and at most mðm=BÞ≤m 2 estimates Ĵh are computed throughout the whole algorithm.Hence, from the union bound we have that ( 6) holds for all estimates with probability at most m 2 ð2δÞ.By setting δ ¼ 1=m 3 , we have that Jðq; hÞ 2 ½ Ĵh −C; Ĵh þC� for all h 2 S active in all iterations of the algorithm, with probability at least 1−m 2 ð2δÞ ¼ 1− 2 m .The fact that Jðq; hÞ 2 ½ Ĵh −C; Ĵh þC� for all h 2 S active implies that h � can never be eliminated and must be in S active at the end of the algorithm.Now consider some h 6 ¼ h � .Suppose ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2 logðm 3 Þ=t used q ¼ 2σ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2 logð1=δÞ=t used q ¼ 2C: (7)

Figure 1 .
Figure1.High-level description of the UPP pipeline.The input sequences are split into two parts, the backbone sequences and the query sequences.An alignment and tree are estimated for the backbone sequences, and an ensemble of HMMs is constructed based on the backbone alignment and tree.This is followed by a query-to-HMM assignment step, which in principle requires computing the probability that each HMM could have generated each query sequence.

Figure 3 .
Figure 3. Scatterplot of our proposed J-score when k ¼ 9 versus the bitscore (the score which UPP attempts to maximize).Each point corresponds to the scores for sequences 68 and 283 from the AMINO test dataset (included with UPP; Nguyen et al. 2015) and one of the HMMs created with a backbone of size 100.Observe the correlation between bitscore and J-score.

Figure 4 .
Figure 4. Performance metrics for various datasets when the J-score is computed exactly for a range of k.The "þ" symbols correspond to the performance metric for UPP for the dataset corresponding to the symbol's color.

Figure 5 .
Figure 5.Time (in seconds) taken for several datasets with long sequence lengths.

Table 2 .
Results for all datasets.a